library(readr)
library(dplyr)
library(ggplot2)
library(methods)
Today we will look at an image image dataset. The data consists of photos of flowers. There are 17 types of flowers and the task is to recognize the flower from the image. We will look at just 10 of the classes in the notes today; your lab will look at working with the entire set of 17. If you are curious, the original paper describing the dataset can be found here:
It was constructed by the Visual Geometry Group (VGG) at Oxford University. Keep this in mind as that name will come up again in our study of image processing.
We need to read image data into R in two steps. First, we load the metadata. This is exactly the same as any other dataset in which we pull in a CSV file from GitHub:
flowers <- read_csv("https://statsmaths.github.io/ml_data/flowers_17.csv")
flowers
## # A tibble: 1,360 x 4
## obs_id train_id class class_name
## <chr> <chr> <dbl> <chr>
## 1 id_000362 valid 4 crocus
## 2 id_000506 train 6 tigerlily
## 3 id_000778 valid 9 sunflower
## 4 id_001233 train 15 windflower
## 5 id_000274 train 3 bluebell
## 6 id_001218 train 15 windflower
## 7 id_001280 test NA <NA>
## 8 id_000895 train 11 colts foot
## 9 id_000851 valid 10 daisy
## 10 id_000084 train 1 snowdrop
## # … with 1,350 more rows
Then, we also have to grab the image data itself. To do this, first download the dataset here:
Save it somewhere on your computer and then read it into R:
x64 <- read_rds("image_data/flowers_17_x64.rds")
What is this object? It is a something similar to a matrix called an array:
class(x64)
## [1] "array"
However, unlike a matrix that only has two dimensions (number of rows and number of columns) an array can have an arbitrary number of dimensions. This dataset, and all of the other image datasets that we will look at, has four dimensions:
dim(x64)
## [1] 1360 64 64 3
What these mean are:
Each number in the array contains a number between 0 and 1 that tells us how much each of the three colors are represented in the image. For example, here are the first five rows and columns of the 40th image:
x64[40, 1:5, 1:5,]
## , , 1
##
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.19873813 0.07141544 0.04626417 0.03853209 0.17168160
## [2,] 0.12598039 0.06516544 0.15939989 0.27371515 0.18682598
## [3,] 0.10089231 0.15621170 0.10606809 0.03233188 0.07957453
## [4,] 0.09633310 0.10986711 0.14794730 0.39114392 0.03502413
## [5,] 0.04144263 0.01869064 0.15061083 0.20364583 0.00000000
##
## , , 2
##
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.20216950 0.12947304 0.1345971 0.10588235 0.20620979
## [2,] 0.14754519 0.10125613 0.1737611 0.27414599 0.23100873
## [3,] 0.15138059 0.19845473 0.1269838 0.05848269 0.10969095
## [4,] 0.12412109 0.15431985 0.1867724 0.48034046 0.07640357
## [5,] 0.08278186 0.03210784 0.1859337 0.26030178 0.01905637
##
## , , 3
##
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.13377566 0.08318015 0.02522021 0.03841912 0.07223307
## [2,] 0.12190181 0.17330538 0.22007123 0.19203623 0.18902420
## [3,] 0.04496017 0.32266965 0.52462852 0.57953240 0.50277076
## [4,] 0.06002604 0.12204733 0.30017233 0.38515242 0.18935547
## [5,] 0.04844707 0.61023859 0.65716337 0.62645144 0.57942517
We will explore this data more through today’s notes.
I only want to look at the first 10 types of flowers in today’s notes. Your task for the next lab will be to look at all 17 types.
x64 <- x64[flowers$class %in% 0:9,,,]
flowers <- flowers[flowers$class %in% 0:9,]
fnames <- flowers$class_name[match(0:9, flowers$class)]
fnames <- factor(fnames, levels = fnames)
Class “1” consists of the snowdrop flower. Let’s see at a few of the images to get a sense of what this flower looks like:
par(mar = c(0,0,0,0))
par(mfrow = c(3, 4))
set.seed(1)
for (i in sample(which(flowers$class == 1), 12)) {
plot(0,0,xlim=c(0,1),ylim=c(0,1),axes= FALSE,type = "n")
rasterImage(x64[i,,,],0,0,1,1)
}
We can look at these and instantly see the similarities. The difficulty is going to be teaching the computer to understand these as well. Let’s now look at a representative flower from all 10 classes:
par(mar = c(0,0,0,0))
par(mfrow = c(3, 4))
set.seed(1)
for (i in 0:9) {
plot(0,0,xlim=c(0,1),ylim=c(0,1),axes= FALSE,type = "n")
j <- sample(which(flowers$class == i), 1)
rasterImage(x64[j,,,],0,0,1,1)
text(0.5, 0.1, flowers$class_name[j], cex = 3, col = "salmon")
}
Notice that color will be useful for telling some of these apart, but not sufficent for distinguishing all classes. Crocus’ and irises, for example, have very similar colors.
To start, we use a trick to flatten the dataset from an array into a matrix. This just unravels the dataset.
X <- t(apply(x64, 1, cbind))
y <- flowers$class
X_train <- X[flowers$train_id == "train",]
y_train <- y[flowers$train_id == "train"]
dim(X_train)
## [1] 400 12288
Why does the matrix X have so many columns? It has 64*64*3 values, which yields 12288 total variables. Notice how quickly the number of components in an image becomes. if we had an HD image this would require 2.7 million values!
\[ 1\;280 * 720 * 3 = 2\;764\;800\]
Let’s apply the elastic net to our dataset (it is really the only technique we know that can handle so many variables):
library(glmnet)
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-16
model <- cv.glmnet(X_train, y_train, family = "multinomial", nfolds = 3)
beta <- coef(model, s = model$lambda.1se)
beta <- Reduce(cbind, beta)
dim(beta[apply(beta != 0, 1, any),])
## [1] 359 10
The resulting model, even at lambda.1se, has over 300 non-zero components. Evaluating the model we see that it heavily overfits to the training data:
pred <- predict(model, newx = X, type = "class")
tapply(pred == y, flowers$train_id, mean)
## train valid
## 0.9325 0.3600
There are ten classes here, so a 36% classification rate is not terrible. I think we can probably do better though!
One difficulty with using the red, green, and blue pixel values is that these do not map very well into a “natural” meaning of color. They are useful for digital screens to display images but not ideal for much else.
Instead, we can use a different color space model that translates red, green, and blue into a different set of variables. One popular choice in statistical learning is the hue, saturation, value space. These three values range from 0 to 10. A good picture helps a lot to understand what the values mean:
Value tells how dark a pixel is, saturation how much color it has (with a low value being close to grey), and hue gives the specific point on a color wheel. Usually a hue of 0 indicates red. Notice that hue is a circular variable, so that a hue of 0.99 is close to a hue of 0.01.
We can convert into HSV space with the base R function rgb2hsv:
i <- 3
red <- as.numeric(x64[i,,,1])
green <- as.numeric(x64[i,,,2])
blue <- as.numeric(x64[i,,,3])
hsv <- t(rgb2hsv(red, green, blue, maxColorValue = 1))
head(hsv)
## h s v
## [1,] 0.5681818 0.3235306 0.2666657
## [2,] 0.5681818 0.2404577 0.3587929
## [3,] 0.5733333 0.2531846 0.3872243
## [4,] 0.5733333 0.2492254 0.3933757
## [5,] 0.5802469 0.2573864 0.4113750
## [6,] 0.5808081 0.3063756 0.4223958
To make sure we understand exactly what these values mean, let’s plot some values in R. The hsv function maps a set of HSV coordinates into a name of the color. Here we look at a bunch of grey colors with varying values (hue is set to 1 and saturation to zero), as well as a set of 10 of hues where saturation and value are set at 1.
color_vals <- c(hsv(1, 0, seq(0, 1, by = 0.2)),
hsv(seq(0, 0.9, by = 0.1), 1, 1))
plot(seq_along(color_vals), seq_along(color_vals),
col = color_vals, pch = 19, cex = 5)
A good trick with HSV space is to discritize the pixels into a small set of fixed colors. We will start by using the buckets defined in the previous plot.
We will do that we creating a vector called color set to #000000 (pure black) and then changing the color depending on the HSV coordinates. If the saturation is less than 0.2 this is a pixel too washed out to make out a reasonable color. We then set it to a shade of grey depending on value split into five buckets. If the saturation is higher than 0.2 and value is higher than 0.2 (i.e., it is not too dark), we bucket the hue into ten buckets. Points with a low value are all kept at the default of black.
color <- rep("#000000", nrow(hsv))
index <- which(hsv[,2] < 0.2)
color[index] <- hsv(1, 0, round(hsv[index,2] * 5) / 5)
index <- which(hsv[,2] > 0.2 & hsv[,3] > 0.2)
color[index] <- hsv(round(hsv[index,1],1), 1, 1)
table(factor(color, levels = color_vals))
##
## #000000 #333333 #666666 #999999 #CCCCCC #FFFFFF #FF0000 #FF9900 #CCFF00
## 204 260 0 0 0 0 13 2870 83
## #33FF00 #00FF66 #00FFFF #0066FF #3300FF #CC00FF #FF0099
## 28 3 22 611 2 0 0
For the one test image, we see that the dominant color is “#FF9900”, an orange, followed by “#0066FF”, a blue.
We can use these counts as features to tell us about a given flower. Let’s cycle over the entire dataset and grab these features.
X_hsv <- matrix(0, ncol = length(color_vals),
nrow = nrow(flowers))
for (i in seq_len(nrow(flowers))) {
red <- as.numeric(x64[i,,,1])
green <- as.numeric(x64[i,,,2])
blue <- as.numeric(x64[i,,,3])
hsv <- t(rgb2hsv(red, green, blue, maxColorValue = 1))
color <- rep("#000000", nrow(hsv))
index <- which(hsv[,2] < 0.2)
color[index] <- hsv(1, 0, round(hsv[index,3] * 5) / 5)
index <- which(hsv[,2] > 0.2 & hsv[,3] > 0.2)
color[index] <- hsv(round(hsv[index,1],1), 1, 1)
X_hsv[i,] <- table(factor(color, levels = color_vals))
}
head(X_hsv)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## [1,] 435 191 283 207 280 175 100 579 752 105 0 0 0
## [2,] 0 370 731 135 7 19 957 611 42 165 972 51 11
## [3,] 124 32 160 13 128 7 13 2870 83 28 3 22 611
## [4,] 94 42 299 552 223 117 12 245 1754 14 0 0 1
## [5,] 2825 232 32 41 245 144 2 188 214 8 1 0 4
## [6,] 415 114 222 193 180 434 135 121 305 903 259 73 152
## [,14] [,15] [,16]
## [1,] 35 832 122
## [2,] 5 1 19
## [3,] 2 0 0
## [4,] 404 335 4
## [5,] 25 88 47
## [6,] 486 36 68
The 8th column is the orange color and the 9th a greenish color, both popular from the flowers and the background greenery.
We can use this new matrix to fit another elastic net. The matrix is small enough that we could use other techniques too, but I’ll keep it consistent here.
y <- flowers$class
X_train <- X_hsv[flowers$train_id == "train",]
X_valid <- X_hsv[flowers$train_id == "valid",]
y_train <- y[flowers$train_id == "train"]
y_valid <- y[flowers$train_id == "valid"]
library(glmnet)
model <- cv.glmnet(X_train, y_train, family = "multinomial")
beta <- coef(model, s = model$lambda.1se)
beta <- Reduce(cbind, beta)
colnames(beta) <- fnames
rownames(beta) <- c("(intercept)", color_vals)
as.matrix(beta)
## daffodil snowdrop lily valley bluebell
## (intercept) 1.770461e-01 -1.489402e-02 1.119792e+00 0.2156050036
## #000000 0.000000e+00 2.393722e-04 9.244839e-05 -0.0001356352
## #333333 1.292036e-03 3.897117e-05 -8.897221e-04 -0.0009465659
## #666666 0.000000e+00 0.000000e+00 1.657749e-04 0.0000000000
## #999999 0.000000e+00 1.087899e-03 0.000000e+00 0.0000000000
## #CCCCCC -1.788616e-03 1.068943e-03 0.000000e+00 0.0000000000
## #FFFFFF -1.343429e-03 1.447499e-03 0.000000e+00 0.0000000000
## #FF0000 0.000000e+00 0.000000e+00 0.000000e+00 0.0000000000
## #FF9900 6.590324e-04 0.000000e+00 -1.317518e-03 -0.0005517535
## #CCFF00 0.000000e+00 0.000000e+00 0.000000e+00 0.0000000000
## #33FF00 -3.320615e-05 -2.903422e-04 3.651323e-04 0.0000000000
## #00FF66 0.000000e+00 0.000000e+00 0.000000e+00 0.0001266336
## #00FFFF 2.898429e-04 3.269193e-04 4.960677e-04 0.0000000000
## #0066FF 1.597266e-04 0.000000e+00 -7.731947e-04 0.0000000000
## #3300FF 0.000000e+00 0.000000e+00 0.000000e+00 0.0031461614
## #CC00FF 0.000000e+00 0.000000e+00 0.000000e+00 0.0020355444
## #FF0099 0.000000e+00 0.000000e+00 0.000000e+00 0.0000000000
## crocus iris tigerlily tulip
## (intercept) 2.056405e-01 1.424085e-01 -0.2184719710 1.1743047056
## #000000 3.070082e-05 -3.567885e-04 0.0000000000 0.0002189562
## #333333 2.607157e-04 -3.664088e-03 0.0000000000 0.0000000000
## #666666 0.000000e+00 -3.545157e-04 0.0000000000 0.0000000000
## #999999 9.013507e-04 -8.905244e-05 0.0000000000 0.0000000000
## #CCCCCC 1.965885e-04 1.602054e-03 0.0000000000 -0.0042881467
## #FFFFFF 7.882126e-04 0.000000e+00 0.0000000000 -0.0017858055
## #FF0000 0.000000e+00 1.090692e-03 0.0032937491 0.0000000000
## #FF9900 0.000000e+00 0.000000e+00 0.0000000000 0.0000000000
## #CCFF00 -2.136064e-04 2.265112e-04 0.0000000000 0.0000000000
## #33FF00 -4.583371e-04 0.000000e+00 0.0000000000 0.0000000000
## #00FF66 -1.076706e-03 0.000000e+00 0.0000000000 0.0000000000
## #00FFFF -2.047303e-04 0.000000e+00 0.0000000000 -0.0007464020
## #0066FF -2.904192e-04 -7.562564e-05 0.0004041505 -0.0005402705
## #3300FF 2.140144e-03 2.184012e-03 0.0000000000 0.0000000000
## #CC00FF 2.745270e-03 1.326558e-03 0.0000000000 0.0000000000
## #FF0099 0.000000e+00 0.000000e+00 0.0000000000 0.0000000000
## fritillary sunflower
## (intercept) -1.398503899 -1.4029265752
## #000000 0.000000000 0.0000000000
## #333333 0.000000000 -0.0005368208
## #666666 0.000000000 0.0000000000
## #999999 0.000000000 0.0000000000
## #CCCCCC 0.000000000 -0.0004132294
## #FFFFFF 0.000000000 0.0000000000
## #FF0000 0.002407056 0.0000000000
## #FF9900 0.000000000 0.0016435783
## #CCFF00 0.000000000 0.0000000000
## #33FF00 0.000000000 0.0000000000
## #00FF66 0.000000000 0.0000000000
## #00FFFF 0.000000000 0.0011926989
## #0066FF 0.000000000 0.0007580761
## #3300FF 0.000000000 0.0000000000
## #CC00FF 0.000000000 0.0000000000
## #FF0099 0.009268888 0.0000000000
We see that some expected patterns here. Snowdrops have a large white coefficient (“#FFFFFF”) and bluebells have a large value for blue (“#3300FF”) and purple (“#CC00FF”). Sunflowers have a large coefficient for orange (“#FF9900”).
This model is slightly more predictive, but importantly is not nearly as overfit.
pred <- as.numeric(predict(model, newx = X_hsv,
type = "class"))
tapply(pred == y, flowers$train_id, mean)
## train valid
## 0.5525 0.4100
The reason for this that the first elastic net likely approximated the kind of analysis we did here, but in doing overfit to the way hue, value, and saturation looked on the training data.
We can improve our model by including more colors. We don’t need any more greys, but lets include a set of 100 hues. This will give us more information about the particular colors for each flower.
color_vals <- c(hsv(1, 0, seq(0, 1, by = 0.2)),
hsv(seq(0, 0.99, by = 0.01), 1, 1))
X_hsv <- matrix(0, ncol = length(color_vals),
nrow = nrow(flowers))
for (i in seq_len(nrow(flowers))) {
red <- as.numeric(x64[i,,,1])
green <- as.numeric(x64[i,,,2])
blue <- as.numeric(x64[i,,,3])
hsv <- t(rgb2hsv(red, green, blue, maxColorValue = 1))
color <- rep("#000000", nrow(hsv))
index <- which(hsv[,2] < 0.2)
color[index] <- hsv(1, 0, round(hsv[index,3] * 5) / 5)
index <- which(hsv[,2] > 0.2 & hsv[,3] > 0.2)
color[index] <- hsv(round(hsv[index,1], 2), 1, 1)
X_hsv[i,] <- table(factor(color, levels = color_vals))
}
We will use the elastic net again here. With the increased set of colors, let’s set alpha to 0.2 to spread the weights out over similar colors.
y <- flowers$class
X_train <- X_hsv[flowers$train_id == "train",]
X_valid <- X_hsv[flowers$train_id == "valid",]
y_train <- y[flowers$train_id == "train"]
y_valid <- y[flowers$train_id == "valid"]
library(glmnet)
model <- cv.glmnet(X_train, y_train, family = "multinomial",
alpha = 0.2)
The model is more predictive with the more grainular color labels:
pred <- as.numeric(predict(model, newx = X_hsv, type = "class"))
tapply(pred == y, flowers$train_id, mean)
## train valid
## 0.6075 0.4300
We can create an interesting visualization of these values by showing the weights as a function of the actual colors for each flower.
beta <- coef(model, s = model$lambda.min)
beta <- as.matrix(Reduce(cbind, beta))[-1,]
colnames(beta) <- fnames
rownames(beta) <- color_vals
df <- tibble(flower = rep(colnames(beta), each = nrow(beta)),
color = rep(rownames(beta), ncol(beta)),
beta = as.numeric(beta))
cols <- color_vals; names(cols) <- color_vals
df$color <- factor(df$color, levels = color_vals)
filter(df, beta > 0) %>%
ggplot(aes(color, flower)) +
geom_point(aes(color = color, size = beta), show.legend = FALSE) +
theme_minimal() +
scale_colour_manual(values = cols) +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.line = element_line(colour = "black"))
Tigerlily’s are very red, whereas bluebells, crocuses, and irises have a blue/purple color.
At the very least, I think it visually looks very neat even if it is not particularly helpful from a predictive standpoint.